Using SAS/STAT to implement a multivariate adaptive outlier detection approach to distinguish outliers from extreme values

نویسنده

  • Paulo Macedo
چکیده

Hawkins (1980) defines an outlier as “an observation that deviates so much from other observations as to arouse the suspicion that it was generated by a different mechanism”. To identify data outliers, a classic multivariate outlier detection approach implements the Robust Mahalanobis Distance Method by splitting the distribution of distance values in two subsets (within-the-norm and out-of-the-norm), with the threshold value usually set to the 97.5% Quantile of the Chi-Square distribution with p (number of variables) degrees of freedom and items whose distance values are beyond it are labeled out-of-the-norm. This threshold value is an arbitrary number, however, and it may flag as out-of-the-norm a number of items that are actually extreme values of the baseline distribution rather than outliers. Therefore, it is desirable to identify an additional threshold, a cutoff point that divides the set of out-of-norm points in two subsets extreme values and outliers. One way to do this – in particular for larger databases – is to Increase the threshold value to another arbitrary number but this approach requires taking into consideration the size of the dataset since size will affect the threshold separating outliers from extreme values. A 2003 article by Gervini (Journal of Multivariate Statistics) proposes “an adaptive threshold that increases with the number of items n if the data is clean but it remains bounded if there are outliers in the data.” In 2005 Filzmoser, Garrett and Reimann (Computers & Geosciences) built on Gervini’s contribution to derive by simulation a relationship between the number of items n, the number of variables in the data p and a critical ancillary variable for the determination of outlier thresholds. This paper implements the Gervini adaptive threshold value estimator using PROC ROBUSTREG and the SAS ChiSquare functions CINV and PROBCHI, available in the SAS/STAT environment. It also provides data simulations to illustrate the reliability and the flexibility of the method in distinguishing true outliers from extreme values.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Identification of outliers types in multivariate time series using genetic algorithm

Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...

متن کامل

Multivariate outlier detection in exploration geochemistry

A new method for multivariate outlier detection able to distinguish between extreme values of a normal distribution and values originating from a different distribution (outliers) is presented. To facilitate visualising multivariate outliers spatially on a map, the multivariate outlier plot, is introduced. In this plot different symbols refer to a distance measure from the centre of the distrib...

متن کامل

Outlier Detection in Wireless Sensor Networks Using Distributed Principal Component Analysis

Detecting anomalies is an important challenge for intrusion detection and fault diagnosis in wireless sensor networks (WSNs). To address the problem of outlier detection in wireless sensor networks, in this paper we present a PCA-based centralized approach and a DPCA-based distributed energy-efficient approach for detecting outliers in sensed data in a WSN. The outliers in sensed data can be ca...

متن کامل

Local multivariate outliers as geochemical anomaly halos indicators, a case study: Hamich area, Southern Khorasan, Iran

Anomaly recognition has always been a prominent subject in preliminary geochemical explorations. Among the regional geochemical data processing, there are a range of statistical and data mining techniques as well as different mapping methods, which serve as presentations of the outputs. The outlier’s values are of interest in the investigations where data are gathered under controlled condition...

متن کامل

The Art of Data Visualization: Detecting Multivariate Data Outliers Using an Interactive Approach

Successfully detecting outliers in multivariate data requires statistical and programming skills and can be very time consuming. Requests for outlier detection can come from different skills groups therefore it is more efficient and effective to allow users to interact directly with the data themselves. We have developed an interactive, web based data visualization application for outlier detec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014